By Yifan Ding
This dataset is collected from Guangzhou Women and Children’s Medical Center, Guangzhou, and consists of X-ray images of children age 1-5, some of whom had bacterial or viral pneumonia. The images are x-rays of the chest and have been labeled by two different doctors---with a third the tiebreaker if the two disagreed. The original images are in different size, so we resized each image to 160x240 pixels jpeg and converted them into grayscale images. The following graph is from the last page from the Cell Paper: "Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning" shows the distinction between a normal lung, a lung with bacterial pneumonia, and a lung with viral pneumonia. https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia
from IPython.display import Image
Image(filename = "/Users/xuechenli/Downloads/lessons/7324/lab2/Lung_Classification.png" )
The purpose of this dataset was to develop an artificial neural network that will be able to distinguish between children with pneumonia in order to assist doctors in making the right decision. There are three different classifications: normal, pneumonia-bacterial, and pneumonia-viral. The cell paper, "Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning"'s focus is to use transfer learning which is a technique to train a neural network with a relatively small number of images, (Kermany 1122). For us, our purpose will be to classify the aforementioned three types for a preliminary screening. As such, the kids will still receive medial care and a second opinion from a doctor so the stakes are not quite as high for our algorithm.
There are three different classifications for our image data. The prediction task for this algorithm is to distinguish between children with a normal lung, a lung with bacterial pneumonia, and a lung with viral pneumonia. This algorithm would be used by hospitals who have an x-ray machine and who serve children between the ages of 1-5. Though this data was screened by Chinese children, it can probably be used for children of other nationalities as well.
According to the Cell Paper, data collected by the World Health organization shows that pneumonia kills approximately 2 million children under 5 years old every year (though most of these deaths occur in Southeast asia or Africa). (Kermany 1127) Since chest x-rays are common and can be used to identify between kids with pneumonia and kids without pneumonia, x-rays were chosen as the method of choice. If we could develop an accurate and quick classifier, it might be able to be used wherever x-rays are used. If developed, such an algorithm could be used by nurses and would not require a doctor to analyze the chest x-ray images. This technique would just save Doctors' time and, potentially, the children who are suffering from pneumonia. Th algorithm would successfully screen kids with pneumonia and direct them to the needed medical care: antibiotics if the child had bacterial pneumonia, supportive care if the child had viral pneumonia, and discharge if the child does not have pneumonia.
In order for our algorithm to be useful, it would have to be better than the neural net algorithm that has already been created using the same data in some way. This could mean that our algorithm achieves a greater accuracy than their 93 % and that the area under the ROC curve for our algorithm is better than the area under the ROC curve, (Kermany 1127). The ROC curve is a way to measure the performance of the algorithm by graphing the true positive rate vs. the false positive rate. The higher to the left the algorithm line is the better the algorithm is. The algorithm in the paper achieved the following ROC curve:
Image(filename = "/Users/xuechenli/Downloads/lessons/7324/lab2/ROC_Curve.png" )
In all classification problems, it is important to consider which is worse: false positives or false negatives. In this case, we will define a false positive as when the algorithm predicts that a child has pneumonia even when he or she doesn’t. A false negative is when the classivier predicts that the child does not have pneumonia even when the child does. In this case, it is clear that we want to limit the amount of false negatives and instead have more false positives. If there is a false positive, all that will happen is that the child will go under more supervised care—if the child does not have pneumonia, this will probably be found with time. If there is a false negative, though, the child will potentially leave the hospital even though he or she has pneumonia. Clearly, we will try to have more false positives than false negatives in this case.
Kermany, D., Goldbaum, M., Cai, W., Valentim, C., Liang, H., Baxter, S., McKeown, A., Yang, G., Wu, X., Yan, F., Dong, J., Prasadha, M., Pei, J., Ting, M., Zhu, J., Li, C., Hewett, S., Dong, J., Ziyar, I., Shi, A., Zhang, R., Zheng, L., Hou, R., Shi, W., Fu, X., Duan, Y., Huu, V., Wen, C., Zhang, E., Zhang, C., Li, O., Wang, X., Singer, M., Sun, X., Xu, J., Tafreshi, A., Lewis, M., Xia, H. and Zhang, K. (2018). Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning. Cell, 172(5), pp.1122-1131.e9.
Kaggle Dataset: https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia
!pip install opencv-python
import numpy as np
import pandas as pd
import os, sys
import cv2
from tqdm import tqdm
import skimage
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
import matplotlib.pyplot as plt
folder1 = "/Users/yifan/Desktop/train/NORMAL/"
normal = [f for f in os.listdir(folder1) if os.path.isfile(os.path.join(folder1, f))]
folder2 = "/Users/yifan/Desktop/train/PNEUMONIA/"
pneumonia = [f for f in os.listdir(folder2) if os.path.isfile(os.path.join(folder2, f))]
folder = "/Users/yifan/Desktop/train/"
chest = normal + pneumonia
print("Working with {0} images".format(len(chest)))
We displayed some images from both groups.
from IPython.display import display
from IPython.display import Image
def images_plot(images,range_low, range_high):
if ((range_low and range_high) in range(len(normal)) and (range_low < range_high)):
for i in range(range_low, range_high):
print(images[i])
display(Image(filename=folder1 +"/" + images[i], width = 160, height = 240))
elif ((range_low and range_high) in range(len(normal)+1, len(chest)) and (range_low < range_high)):
for i in range(range_low, range_high):
print(images[i])
display(Image(filename=folder2 +"/" + images[i], width = 160, height = 240))
else:
print("Error: Out of Range")
images_plot(chest,2364,2367)
images_plot(chest,123,126)
From the images of normal chest and pneumonia, we can hardly tell the differences between them just by insight. It indicates that further analysis of images is essential for this case.
#Code from https://github.com/deadskull7/Pneumonia-Diagnosis-using-XRays-96-percent-Recall/blob/master/Pneumonia%20Diagnosis%20using%20Lung's%20XRay%20.ipynb
def read_data(folder):
images = []
labels = [] #Ture status
for dirc in os.listdir(folder):
readin = folder + dirc
if not dirc.startswith('.'):
if dirc in ['NORMAL']:
for image_name in tqdm(os.listdir(readin)):
label = 'normal'
labels.append(label)
elif dirc in ['PNEUMONIA']:
for image_name in tqdm(os.listdir(readin)):
if 'bacteria' in str(image_name):
label = 'bacteria'
elif 'virus' in str(image_name):
label = 'virus'
labels.append(label)
for image_name in tqdm(os.listdir(readin)):
img = cv2.imread(readin + '/' + image_name) #Read in images from folder
if img is not None:
img = skimage.transform.resize(img, (160,240,3)) #Resize each image into 160*240
img = np.asarray(img) #Turn each image into array
img = ((img/255.)-.5) * 2 #Standardization
images.append(img)
images = np.asarray(images)
labels = np.asarray(labels)
return images,labels
chest_images, chest_ture = read_data(folder)
print(chest_images.shape)
print(chest_ture.shape)
print(chest_ture)
i=0
j=0
k=0
for label in chest_ture:
if label == 'bacteria':
i=i+1
elif label == 'virus':
j=j+1
else:
k=k+1
print('bacteria:',i,
'virus:',j,
'normal:',k)
import matplotlib.pyplot as plt
# Data to plot
labels = 'bacteria','virus','normal'
sizes = [2530,1345,1341]
colors = ['gold', 'yellowgreen', 'lightcoral']
# Plot
plt.pie(sizes,labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=140)
plt.show()
luminance is by far more important in distinguishing visual features. An excellent suggestion to illustrate this property would be: take a given image and separate the luminance plane from the chrominance planes. We will use 0.3R+0.59G+0.11*B to convert all the images into gray sclae. Range of grayscale values should spread out between 0 and 255.
def gray_scale(data):
'''
input: a np.array of images of rgb format
output: a np.array of images of grayscale format
'''
n_images = data.shape[0]
n_rows = data.shape[1]
n_columns = data.shape[2]
grayscale = np.zeros((n_images, n_rows, n_columns, 1))
for idx in range(n_images):
grayscale[idx, :, :, 0] = np.add(0.3*data[idx,:,:,0], 0.59*data[idx,:,:,1],
0.2*data[idx,:,:,2])
return grayscale
chest_gray = gray_scale(chest_images)
print(chest_gray.shape)
def linearize(data):
'''
input:a np.array of images
output: a 2-D np.array(1-D image feature for each row)
'''
num_images = data.shape[0]
num_columns = int(np.prod(data.shape)/num_images)
linear = np.zeros((num_images, num_columns))
linear = np.reshape(data, (num_images, num_columns))
return linear
chest_gray_linear = linearize(chest_gray)
chest_rgb_linear = linearize(chest_images)
print(chest_gray_linear.shape)
Each column in dataset represents a specific pixel, each row represents one image,which is total 5216 images. A value in dataset represents the darkness or grayscale of the image. Our target is 3 classes outuput, normal, bacteria and virus.
Raschka, S. and Mirjalili, V. (n.d.). Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow. 2nd ed. Packt Publishing.
from sklearn.decomposition import PCA
import seaborn as sns
from sklearn.decomposition import IncrementalPCA
from sklearn.decomposition import KernelPCA
from sklearn.decomposition import SparsePCA
def full_pca(data,n):
'''
input:data,n_components
output: full pca of data
'''
chest_pca = PCA(n_components=n)
return chest_pca.fit(chest_gray_linear)
chest_pca = full_pca(chest_gray_linear,5216)
def explain_variance(pca):
'''
input:pca
output: explained variance
'''
explained_var = pca.explained_variance_ratio_
return explained_var
def cumu_variance(pca):
'''
input:pca
output: cumulative variance
'''
cumu_var_exp = np.cumsum(pca.explained_variance_ratio_)
return cumu_var_exp
cumu_var = cumu_variance(chest_pca)
explained_var = explain_variance(chest_pca)
print(cumu_var)
print(explained_var)
#!pip install cufflinks plotly
#The plot_explained_variance function is adapted from Eric's 04. Dimension Reduction and Images
from plotly.graph_objs import Scatter, Marker, Layout, XAxis, YAxis, Bar, Line
import plotly
def plot_explained_variance(var1,var2):
plotly.offline.iplot({
"data": [Scatter(y=var1, name='Explained variance'),
Scatter(y=var2, name='cumulative explained variance')
],
"layout": Layout(xaxis=XAxis(title='Principal components'), yaxis=YAxis(title='Explained variance ratio'))
})
plot_explained_variance(explained_var,cumu_var)
Retaining 40 components will get an explained variance ratio of 0.8 and retaining 200 components will get an explained variance ratio of 0.9. Since 200 is at the knee of the graph, it is appropriate for us to using the first 200 components to represent the chest image.
chest_pca_first200 = full_pca(chest_gray_linear,200)
cumu_var = cumu_variance(chest_pca_first200)
explained_var = explain_variance(chest_pca_first200)
plot_explained_variance(explained_var,cumu_var)
eigen_chest = chest_pca_first200.components_.reshape(200,160,240)
eigen_chest.shape
#The plot_gallery function is from Eric's 04. Dimension Reduction and Images
import matplotlib.pyplot as plt# a helper plotting function
def plot_gallery(images, titles, h, w, n_row=4, n_col=6):
"""
input: image matrix
output: image gallery
"""
plt.figure(figsize=(1.7 * n_col, 2.3 * n_row))
plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
for i in range(n_row * n_col):
plt.subplot(n_row, n_col, i + 1)
plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
plt.title(titles[i], size=12)
plt.xticks(())
plt.yticks(())
eigenlabels = ['eigenimage ' + str(i) for i in range(eigen_chest.shape[0])]
plot_gallery(eigen_chest,eigenlabels,160,240)
The components represent our images well. The first component reflect the average of the 5216 images
#Reconstruct_image function is adapated from Eric's 04 Dimensional Reduction and Images
''' original from Eric
def reconstruct_image(trans_obj,org_features):
low_rep = trans_obj.transform(org_features)
rec_image = trans_obj.inverse_transform(low_rep)
return low_rep, rec_image
idx_to_reconstruct = 131
chest_gray_linear_idx = chest_gray_linear[idx_to_reconstruct]
low_dim, reconstructed_image = reconstruct_image(chest_pca_first200,chest_gray_linear_idx.reshape(1, -1))
'''
def reconstruct_image(trans_obj,pca_features,idx):
'''
input:pca_data,trans_obj,org_features,idx
output:tranformation of the specific picture
'''
low_dim = trans_obj.transform(pca_features[idx].reshape(1,-1))
rec_image = trans_obj.inverse_transform(low_dim)
return low_dim, rec_image
chest_gray_linear.shape
chest_pca_first200.components_.shape
#Make a comparison here
#Take the 100th image as an example
low_dim, rec_image = reconstruct_image(chest_pca_first200,chest_gray_linear,100)
plt.subplot(1,2,1)
plt.imshow(chest_gray_linear[100].reshape(160,240), cmap=plt.cm.gray)
plt.title('Original')
plt.grid(False)
plt.subplot(1,2,2)
plt.imshow(rec_image.reshape(160,240), cmap=plt.cm.gray)
plt.title('full component PCA')
plt.grid(False)
As a check on our code, we tried to calculate Radial Based PCA as a negative case to see whether it is the worst one among those PCA methods. From the image analysis, we expect it to be the worst. Thankfully, it only has an accuracy of 72%.
So comparing PCA and Kernal PCA with the first 200 components, it is clear that the PCA and polynomial PCA do the best with more than 95% accuracy, while the Radial Based PCA does the worst, with only 75% accuarcy.
Above all, we would use the full PCA with first 200 components to be the final dataset that is used for classification.
# Data to plot
labels = 'bacteria','virus','normal'
sizes = [2530,1345,1341]
colors = ['gold', 'yellowgreen', 'lightcoral']
# Plot
plt.pie(sizes,labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=140)
plt.show()
As you can see above, we have 1341 samples that are normal, 2530 samples that are pneumonia-bacteria and 1345 samples that are pneumonia-viral, the proportion are quite different for three classes. So we want to keep the same proportion of different classes in each fold to make sure that each fold is a good representative of the whole data. Another problem we have to pay attention to is our sample size. From the counts of three categories, the sample size for normal and pneumonia-viral is small. After we do a 80/20 split for train and test set, we only have 1341/5=268 samples of normal in the test set. It is not enough, and we can get almost any performance on this set only due to chance. In K Fold cross validation, the data is divided into k subsets.
Now the holdout method is repeated k times, such that each time, one of the k subsets is used as the test set/ validation set and the other k-1 subsets are put together to form a training set. The error estimation is averaged over all k trials to get total effectiveness of our model. As can be seen, every data point gets to be in a validation set exactly once, and gets to be in a training set k-1 times. This significantly reduces bias as we are using most of the data for fitting, and also significantly reduces variance as most of the data is also being used in validation set. As a general rule and empirical evidence, K = 5 or 10 is generally preferred as it is often reported that the optimal k is between 5 and 10 , because the statistical performance does not increase a lot for larger values of k. So for our problem, using a 5/10 fold cross validation method to do an 80/20 split is a better way. Since we also want the same proportion of different classes in each fold, Stratified 10-fold is a better choice.
Arlot, Sylvain, and Alain Celisse. “A Survey of Cross-Validation Procedures for Model Selection.” Statistics Surveys, The Author, under a Creative Commons Attribution License, https://projecteuclid.org/download/pdfview_1/euclid.ssu/1268143839.
Gupta, Prashant. “Cross-Validation in Machine Learning.” Medium, Towards Data Science, 5 June 2017, https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f.
Shulga, Dima. “5 Reasons Why You Should Use Cross-Validation in Your Data Science Projects.” Medium, Towards Data Science, 27 Sept. 2018, https://towardsdatascience.com/5-reasons-why-you-should-use-cross-validation-in-your-data-science-project-8163311a1e79.
We use our method to evaluate the metric by 10-fold Stratified Cross Validation.
from sklearn.model_selection import StratifiedKFold,cross_val_score
from sklearn import metrics
chest_skf=StratifiedKFold(n_splits=10)
X_train = []
X_test = []
y_train = []
y_test = []
for train_index,test_index in chest_skf.split(chest_gray_linear,chest_ture):
print("Train Index:",train_index,",Test Index:",test_index)
X_train_temp, X_test_temp =chest_gray_linear[train_index],chest_gray_linear[test_index]
y_train_temp ,y_test_temp =chest_ture[train_index],chest_ture[test_index]
X_train.append(X_train_temp)
X_test.append(X_test_temp)
y_train.append(y_train_temp)
y_test.append(y_test_temp)
print(len(X_train[0]))
print(len(X_test[0]))
print(len(y_train[0]))
print(len(y_test[0]))
print(X_train[0].shape)
As we mentioned in False Positive vs False Negative Trade-off, in all classification problems, it is important to consider which is worse: false positives or false negatives. In this case, we will define a false positive as when the algorithm predicts that a child has pneumonia even when he or she doesn’t. A false negative is when the classivier predicts that the child does not have pneumonia even when the child does. In this case, it is clear that we want to limit the amount of false negatives. We also want to keep children from unnessary treatment, which will happen in false positive situation.
Since higher recall ratio illustrates lower false negative and we also concer about lower false positive, we should use F1-score as our main metric. F1-score are defined for binary classes. There are two ways to combine it into multiple classes. micro is calculated for the individual TPs, TNs, FPs, FNs of the confusion matrix, which weights each instance equally. macro is calculated as the average scores of the confusion matrix, which weights each class equally to evaluate the overall performance. Since we have an imbalanced instance for each class, we perfer to use F1 weighted macro-average score.
Given the metric for $K^{th}$ classes $X_k$: $$F1_{micro} = \frac {2\times (TP_1 + ... + TP_k) } {2\times (TP1_1 + ... + TP_k) + FP_1 + ... + FP_k + FN_1 + ... + FN_k} $$
$$F1_{macro} = \frac {X_1 + ... + X_k} {k} $$from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report, confusion_matrix, precision_score, accuracy_score, f1_score
def average_magnitude(gradient):
'''
Input: gradient
Output: average mangintude
'''
absolute = np.abs(gradient)
return np.mean(absolute)
# Adapted from Eric's 07. MLP Neural Networks
import numpy as np
from scipy.special import expit
import sys
import pandas as pd
from sklearn.base import BaseEstimator, ClassifierMixin
# start with a simple base classifier, which can't be fit or predicted
# it only has internal classes to be used by classes that will subclass it
class TwoLayerPerceptronBase(BaseEstimator, ClassifierMixin):
def __init__(self, n_hidden=30,phi_function='sigmoid', cost_function='quadratic', n_layers=2,
C=0.0, epochs=500, eta=0.001, random_state=None):
np.random.seed(random_state)
self.n_hidden = n_hidden
self.l2_C = C
self.epochs = epochs
self.eta = eta
self.phi_function = phi_function
self.cost_function = cost_function
self.n_layers=n_layers
@staticmethod
def _encode_labels(y):
"""Encode labels into one-hot representation"""
onehot = pd.get_dummies(y).values.T
return onehot
def _initialize_weights(self):
"""Initialize weights with small random numbers."""
W = []
for i in range(self.n_layers):
if 0 == i:
# First,d dimension for the first weights
size = (self.n_hidden, self.n_features_ + 1)
elif self.n_layers-1 == i:
# last, stop at the n_layers-1
size = (self.n_output_, self.n_hidden + 1)
else:
size = (self.n_hidden, self.n_hidden + 1)
W.append(np.random.uniform(-1, 1, size))
return W
@staticmethod
def _sigmoid(z):
"""Use scipy.special.expit to avoid overflow"""
# 1.0 / (1.0 + np.exp(-z))
return expit(z)
@staticmethod
def _linear(z):
return z
@staticmethod
def _add_bias_unit(X, how='column'):
"""Add bias unit (column or row of 1s) to array at index 0"""
if how == 'column':
ones = np.ones((X.shape[0], 1))
X_new = np.hstack((ones, X))
elif how == 'row':
ones = np.ones((1, X.shape[1]))
X_new = np.vstack((ones, X))
return X_new
@staticmethod
def _L2_reg(lambda_, W):
"""Compute L2-regularization cost"""
# only compute for non-bias terms
k = 0.
for i in range (len(W)):
k += np.mean(W[i][:, 1:] ** 2)
return (lambda_/2.0) * np.sqrt(k)
def _cost(self, A_last, Y_enc, W):
'''Get the objective function value'''
if 'quadratic' == self.cost_function:
cost = np.mean((Y_enc - A_last) ** 2)
elif 'crossentropy' == self.cost_function:
cost = -np.mean((Y_enc * np.log(A_last) + (1 - Y_enc) * np.log(1 - A_last)))
L2_term = self._L2_reg(self.l2_C, W)
return cost + L2_term
class TwoLayerPerceptron(TwoLayerPerceptronBase):
def _feedforward(self, X, W):
"""Compute feedforward step"""
if 'linear' == self.phi_function:
activate = self._linear
elif 'sigmoid' == self.phi_function:
activate = self._sigmoid
else:
raise 'Not a valid of phi_function'
#Initialize A,Z
A, Z = [], []
A.append(self._add_bias_unit(X.T, how='row'))
Z.append(W[0] @ A[0])
for i in range(1,self.n_layers):
A.append(self._add_bias_unit(activate(Z[i-1]), how='row'))
Z.append(W[i] @ A[i])
#Last layer A
A.append(activate(Z[-1]))
return A,Z
def _get_gradient(self, A, Z, Y_enc, W):
""" Compute gradient step using backpropagation."""
# vectorized backpropagation
grad, V = [[]] * self.n_layers, [[]] * self.n_layers
# First
if 'crossentropy' == self.cost_function:
V[-1] = A[-1] - Y_enc
elif 'quadratic' == self.cost_function: # mse
V[-1] = -2 * (Y_enc-A[-1]) * A[-1] * (1-A[-1])
else:
raise 'Not a valid of cost_function'
grad[-1] = V[-1] @ A[-2].T
# After first
for i in range(self.n_layers-2, -1, -1): #We want to start at the last layer, including 0
if 'linear' == self.phi_function:
V[i] = W[i+1].T @ V[i+1]
elif 'sigmoid' == self.phi_function:
V[i] = A[i+1] * (1-A[i+1]) * (W[i+1].T @ V[i+1])
V[i] = V[i][1:,:]
grad[i] = V[i] @ A[i].T
# regularize weights that are not bias terms
for i in range(len(grad)):
grad[i][:, 1:] += W[i][:, 1:] * self.l2_C
return grad
def predict(self, X):
"""Predict class labels"""
A, _ = self._feedforward(X, self.W)
y_pred = np.argmax(A[-1], axis=0)
return [self.class_labels_[i] for i in y_pred] #get rid of index
def fit(self, X, y, print_progress=False):
""" Learn weights from training data."""
X_data, y_data = X.copy(), y.copy()
self.class_labels_ = np.unique(y)
Y_enc = self._encode_labels(y)
# init weights and setup matrices
self.n_features_ = X_data.shape[1]
self.n_output_ = Y_enc.shape[0]
self.W = self._initialize_weights()
self.magnitude_gradients = []
for i in range(self.n_layers):
self.magnitude_gradients.append(np.zeros([self.epochs, 1]))
self.cost_ = []
self.score_ = []
gradients = [[]] * self.n_layers
for epoch in range(self.epochs):
if print_progress>0 and (epoch+1)%print_progress==0:
sys.stderr.write('\rEpoch: %d/%d' % (epoch+1, self.epochs))
sys.stderr.flush()
# feedforward all instances
A, Z = self._feedforward(X_data, self.W)
cost = self._cost(A[-1], Y_enc, self.W)
self.cost_.append(cost)
self.score_.append(f1_score(y, self.predict(X_data), average='weighted'))
# compute gradient via backpropagation
grad = self._get_gradient(A, Z, Y_enc, self.W)
for grad_idx in range(len(grad)):
self.W[grad_idx] -= self.eta * grad[grad_idx]
gradients[grad_idx].append(np.mean(grad[grad_idx]))
for layer in range(self.n_layers):
self.magnitude_gradients[layer][epoch] = average_magnitude(grad[layer])
self.gradients_ = [[]] * self.n_layers
for i in range(self.n_layers):
self.gradients_[i] = np.array(gradients[i])
return self
def my_para(phi_function,cost_function,n_layer):
my_params = dict(n_hidden=30,
C=0.01,
epochs=200, #Change a large number
eta=0.01,
phi_function = phi_function , # Support: sigmoid, linear, relu, silu
cost_function = cost_function, # Support: quadratic, crossentropy
n_layers=n_layer)
return my_params
cost_functions = ['quadratic', 'crossentropy']
phi_functions = ['sigmoid', 'linear']
n_layers = [2,3,5]
We tried to use "GirdSearchCV" of scikit-learn, but it did not work, we decided to use a for loop to determine the hyper parameters of our model. (Following code is our "GridSearchCV", and we really have no idea how to debug it, please help!!!)
#from sklearn.model_selection import cross_val_score
# scores = cross_val_score(estimator=nn_pipe,
# X=X_train,
# y=y_train,
# cv=10,
# n_jobs=-1,
# scoring = 'f1_macro')
#print('CV f1 scores: %s' % scores)
#print('CV f1 score: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))
# n_hidden = [30]
# C = [0.01]
# epochs = [200]
# eta = [0.01, 0.001]
# cost_functions = ['quadratic', 'crossentropy']
# phi_functions = ['sigmoid', 'linear']
# n_layers = [2,3,5]
# random_state=[1]
# param_grid = {'nn__n_hidden':[30], 'nn__C': [0.01], 'nn__epochs': [1], 'nn__eta':[0.01,0.001],
# 'nn__phi_function': ['sigmoid', 'linear'], 'nn__cost_function':['quadratic', 'crossentropy'],
# 'nn__n_layers': [2,3,5],'nn__random_state': [1]}
# nn_pipe_verbose = Pipeline([('pca', PCA(n_components=200)),
# ('scl',StandardScaler()),
# ('nn', TwoLayerPerceptron())])
# print(nn_pipe_verbose.get_params().keys())
# from sklearn.model_selection import GridSearchCV
# gs = GridSearchCV(estimator = nn_pipe_verbose,
# param_grid = param_grid,
# scoring = 'f1_macro',
# cv = 2)
# gs.fit(X_train_all, y_train_all)
# scores = cross_val_score(gs, X_train_all, y_train_all, scoring = 'f1_macro', cv = 10, n_jobs=-1)
# print('CV . f1 score: %.3f +/- %.3f' % (np.mean(scores), mp.std(scores)))
# print(scores.report())
# scores.boxplot_parameters(display_train = False)
After using CV to tune parameter, we used test set to test the model's performance.
# Use TwoLayerPerceptron on the gray_scaled chest data(no pca used here), written by ourselves
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
f1_all = np.zeros([2,2,3,10])
for cost_idx, cost_function in enumerate(cost_functions):
for phi_idx, phi_function in enumerate(phi_functions):
for n_idx, n_layer in enumerate(n_layers):
my_params = my_para(phi_function,cost_function,n_layer)
my_mlp = TwoLayerPerceptron(**my_params)
my_f1_lst = []
my_acc_lst = []
my_pre_lst = []
X_train_temp_train_all = []
X_train_temp_validate_all = []
y_train_temp_train_all = []
y_train_temp_validate_all = []
for idx, _ in enumerate(X_train):
#X_train_temp = X_train[idx]
#y_train_temp = y_train[idx]
#X_test_temp = X_test[idx]
#y_test_temp = y_test[idx]
X_train_temp_train, X_train_temp_validate, y_train_temp_train, y_train_temp_validate\
= train_test_split(X_train[idx],y_train[idx],test_size = 0.25)
X_train_temp_train_all.append(X_train_temp_train)
X_train_temp_validate_all.append(X_train_temp_validate)
y_train_temp_train_all.append(y_train_temp_train)
y_train_temp_validate_all.append(y_train_temp_validate)
X_train_final = X_train_temp_train_all[idx]
X_validate_final = X_train_temp_validate_all[idx]
y_train_final = y_train_temp_train_all[idx]
y_validate_final = y_train_temp_validate_all[idx]
#X_train_temp_train_all.append(X_train_temp_train[idx])
# PCA: Using the first 200 components on the train set
pca = PCA(n_components=200)
pca_model = pca.fit(X_train_final)
X_train_temp = pca_model.transform(X_train_final)
X_cv_temp = pca_model.transform(X_validate_final)
# Normalize
scaler = StandardScaler()
st_scaler= scaler.fit(X_train_final)
X_train_temp = st_scaler.transform(X_train_final)
X_cv_temp = st_scaler.transform(X_validate_final)
my_mlp.fit(X_train_final, y_train_final)
my_yhat = my_mlp.predict(X_cv_temp)
my_f1_score = f1_score(y_validate_final,my_yhat,average='weighted')
my_f1_lst.append(my_f1_score)
f1_all[cost_idx, phi_idx, n_idx, idx] = my_f1_score
f1_ave = np.mean(my_f1_lst)
print('cost:', cost_function, 'phi:', phi_function, 'layer:',n_layer, 'f1_score:', f1_ave)
print(f1_ave)
print(f1_all)
The result is the same with the best F1 score of 31.37% when using corssentropy as cost function, sigmoid as phi function, and 2 layers.
cost_functions = ['quadratic', 'crossentropy']
phi_functions = ['sigmoid', 'linear']
n_layers = [2,3,5]
m = 0
plt.figure(figsize=(15,15))
for cost_idx, cost_function in enumerate(cost_functions):
for phi_idx, phi_function in enumerate(phi_functions):
for n_idx, n_layer in enumerate(n_layers):
m = m+1
plt.subplot(4,3,m)
plt.boxplot(f1_all[cost_idx,phi_idx,n_idx,:])
plt.title('cost: ' + str(cost_function) + ' phi: ' + str(phi_function) + ' n_layer:' + str(n_layer))
So we can seen from the boxplot above that 'cost:crossentrophy, phi:sigmoid, n_layer:2' has the highest f1_score from our grid search. So this is our best model.
#Using pipeline in sklearn to combine PCA, Standardization and TwoLayerPerceptron method together
from sklearn.preprocessing import StandardScaler
n_components = 200
nn_pipe = Pipeline([('pca', PCA(n_components=n_components)),
('scl',StandardScaler()),
('nn', TwoLayerPerceptron())
])
from sklearn.model_selection import train_test_split
X_train_all, X_test_all, y_train_all, y_test_all = train_test_split(chest_gray_linear,chest_ture,test_size = 0.2)
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score
kfold = StratifiedKFold(n_splits=10,
random_state=1).split(X_train_all, y_train_all)
my_params = my_para('sigmoid', 'crossentropy',2)
my_mlp = TwoLayerPerceptron(**my_params)
scores = []
for k, (train, test) in enumerate(kfold):
nn_pipe = Pipeline([('pca', PCA(n_components=n_components)),
('scl',StandardScaler()),
('nn', my_mlp)
])
nn_pipe.fit(X_train_all[train], y_train_all[train])
yhat = nn_pipe.predict(X_train_all[test])
score = f1_score(yhat, y_train_all[test], average='weighted')
scores.append(score)
print('Fold: %s, F1_score: %.3f' % (k+1, score))
print('\n F1_score: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))
## Our Best version:
## Using test dataset to verify model performance
phi_function = 'sigmoid'
cost_function = 'crossentropy'
n_layer = 2
my_params = my_para(phi_function,cost_function,n_layer) #epoch=200
my_mlp = TwoLayerPerceptron(**my_params)
# PCA: Using the first 200 components on the train set
pca = PCA(n_components=200)
for i in range(10):
pca_model = pca.fit(X_train[i])
X_train_temp = pca_model.transform(X_train[i])
X_test_temp = pca_model.transform(X_test[i])
# Normalize
scaler = StandardScaler()
st_scaler= scaler.fit(X_train_temp)
X_train_temp = st_scaler.transform(X_train_temp)
X_test_temp = st_scaler.transform(X_test_temp)
my_mlp.fit(X_train_temp, y_train[i])
my_yhat = my_mlp.predict(X_test_temp)
y_test_temp = y_test[i]
my_f1_score = f1_score(y_test_temp,my_yhat,average='weighted')
print('fold'+ str(i+1),my_f1_score)
# lets load up the handwritten digit dataset
from sklearn.datasets import load_digits
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
import numpy as np
ds = load_digits()
X = ds.data/16.0-0.5 # normalize the data
y = ds.target
print(X.shape)
print(y.shape)
print(np.min(X),np.max(X))
print(np.unique(y))
# reshape and print a few of the images in the digits dataset
import matplotlib.pyplot as plt
%matplotlib inline
fig, ax = plt.subplots(nrows=2, ncols=5, sharex=True, sharey=True,)
ax = ax.flatten()
for i in range(10):
img = X[i].reshape(8, 8)
ax[i].imshow(img, cmap='Greys', interpolation='nearest')
ax[0].set_xticks([])
ax[0].set_yticks([])
plt.show()
clf = TwoLayerPerceptron(n_hidden=10, epochs=1500, eta=0.001)
clf.fit(X,y)
from sklearn.metrics import accuracy_score
yhat = clf.predict(X)
accuracy_score(y,yhat)
#digit data
ax = plt.subplot(1,1,1)
for w in range(n_layers):
plt.plot(clf.magnitude_gradients[w][10:], label = 'g' + str(w +1))
plt.legend()
plt.ylabel('Average gradient magnitude')
plt.xlabel('Iteration')
plt.show()
Compared the accuracy and the magnitude of the gradients plot above, our result is consistent with what we got in ICA3.
# Our chest data
ax = plt.subplot(1,1,1)
for w in range(n_layer):
plt.plot(my_mlp.magnitude_gradients[w][10:], label = 'g' + str(w+1))
plt.legend()
plt.ylabel('Average gradient magnitude')
plt.xlabel('Iteration')
plt.show()
Due to the highly sensitive nature of last layers as it is directly connected to the output, we can see high value of g2.
# To test our model, we visualized the gradients when using 5 layers.
Image(filename = "/Users/xuechenli/Downloads/lessons/7324/lab4/5layer.png" )
from sklearn.neural_network import MLPClassifier
def mlp_sk(activation,max_iter):
mlp=MLPClassifier(activation=activation,
alpha=0.01,
solver='lbfgs',
early_stopping=False,
hidden_layer_sizes=30,
max_iter=max_iter,
random_state=1,
shuffle=True,)
return mlp
active=['identity', 'logistic']
max_iter=200
for activation in active:
sk_mlp = mlp_sk(activation,max_iter)
sk_mlp.fit(X_train_all, y_train_all)
sk_yhat = sk_mlp.predict(X_test_all)
sk_f1=f1_score(sk_yhat, y_test_all, average='weighted')
print('activation:', activation,
'Scikit-Learn f1_score:', sk_f1)
Compared with scikit-learn, we have similar f1 score, which indicates that our model is correct.
In this lab, we builded MLP model based on Chest images data set. We used f1 score as the evaluation criteria and implemented 10 fold stratified cross validation. Our model's performance is constant with scikit-learn.